The search functionality is under construction.

Author Search Result

[Author] Kazuaki MURAKAMI(33hit)

21-33hit(33hit)

  • Evaluating DRAM Refresh Architectures for Merged DRAM/Logic LSIs

    Taku OHSAWA  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E81-C No:9
      Page(s):
    1455-1462

    In merged DRAM/logic LSIs, it is necessary to reduce the number of DRAM refreshes because of higher heat dissipation caused by the logic portion on the same chip. In order to overcome this problem, we propose several DRAM refresh architectures. The basic is to eliminate unnecessary DRAM refreshes. In addition to this, we propose a method for reducing the number of DRAM refreshes by relocating data. In order to evaluate these architectures and method, we have estimated the DRAM refresh count in executing benchmark programs under several models which simulate each combination of them. As a result, in the most effective combination, we have obtained more than 80% reduction against a conventional DRAM refresh architecture for most of benchmark programs. In addition to it, we have taken normal DRAM access into account, even then we have obtained more than 50% reduction for several benchmarks.

  • Performance Models for MPI Collective Communications with Network Contention

    Hyacinthe NZIGOU MAMADOU  Takeshi NANRI  Kazuaki MURAKAMI  

     
    PAPER-Network

      Vol:
    E91-B No:4
      Page(s):
    1015-1024

    The paper presents a novel approach to estimate the performance of MPI collective communications. Our objective is to help researchers to make appropriate decisions on their message-passing applications. For each collective communication, we attempt to apply LogGP and P-LogP standard point-to-point models. The resulted models are compared with the empirical data in order to identify the most suitable for performance characterization of collective operations. For the communications on large clusters with large size messages, the network contention problem can significantly affect the performance. Hence, to reduce the relative gap between the prediction and the measured runtime, the contention issue is also modeled, by a queuing theory analysis method, and taken in account with the total performance estimation. The experiments performed on a cluster which consists of 64 processors interconnected by Gigabit Ethernet network show encouraging results. For any collective operation, given a number of processors and a range of message sizes, there is at least one model that predicts the performance precisely. We could achieve a gap between the predicted and the measured run-time around 15%. Thus, by handling the contention problem, we could reduce around 80% of the relative gap.

  • A High-Performance/Low-Power On-Chip Memory-Path Architecture with Variable Cache-Line Size

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E83-C No:11
      Page(s):
    1716-1723

    This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. In our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.

  • A High-Performance and Low-Power Cache Architecture with Speculative Way-Selection

    Koji INOUE  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E83-C No:2
      Page(s):
    186-194

    This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.

  • Architectural-Level Soft-Error Modeling for Estimating Reliability of Computer Systems

    Makoto SUGIHARA  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER-VLSI Design Technology

      Vol:
    E90-C No:10
      Page(s):
    1983-1991

    This paper proposes a soft-error model for accurately estimating reliability of a computer system at the architectural level within reasonable computation time. The architectural-level soft-error model identifies which part of memory modules are utilized temporally and spatially and which single event upsets (SEUs) are critical to the program execution of the computer system at the cycle accurate instruction set simulation (ISS) level. The soft-error model is capable of estimating reliability of a computer system that has several memory hierarchies with it and finding which memory module is vulnerable in the computer system. Reliability estimation helps system designers apply reliable design techniques to vulnerable part of their design. The experimental results have shown that the usage of the soft-error model achieved more accurate reliability estimation than conventional approaches. The experimental results demonstrate that reliability of computer systems depends on not only soft error rates (SERs) of memories but also the behavior of software running in computer systems.

  • Tradeoffs in Processor Design for Superscalar Architectures

    Kazuaki MURAKAMI  Morihiro KUGA  Oubong GWUN  Shinji TOMITA  

     
    PAPER-Computer Systems

      Vol:
    E74-D No:11
      Page(s):
    3883-3893

    Superscalar processors can improve uniprocessor performance further byond RISC performance by exploiting spatial instruction-level parallelism. Superscalar processor design presents more opportunities for tradeoffs than conventional RISC design. In order to utilize processor resources augmented by the superscalar approaches, processors must be carefully designed and implemented. This paper examines the various aspects of superscalar processors and discusses the design features and tradeoffs. Specific aspects of superscalar processors that are examined include: instruction fetch boundary, instruction-cache line crossing, branch prediction, data-hazard resolution, control-hazard resolution, and precise or imprecise interrupts. This paper uses a superscalar simulator that modeled a DDU (Dynamically-hazard-resolved, Dynamic-code-scheduled, Uniform) superscalar architecture, called SIMP (Single Instructions stream/Multiple instruction Pipelining), and evaluate many different SIMP hardware organizations. This paper concludes that a superscalar processor can increase the performance with major five hardwary features: instruction aligning, branch prediction with branch-target buffer, code scheduling, speculative execution with conditional mode, and imprecise interrupts. However, the first three functions are claimed to be performed by compilers rather than by hardware.

  • Relaxing Constraints due to Data and Control Dependences

    Katsuhiko METSUGI  Kazuaki MURAKAMI  

     
    PAPER-Computer Systems

      Vol:
    E86-D No:5
      Page(s):
    920-928

    TLSP (Thread-Level Speculative Parallel processing) architecture is a growing processor architecture. Parallelism of a program executed on this architecture is ruled by the combination of techniques which relax data dependences. In this paper, we evaluate the limits of parallelism of the TLSP architecture by using abstract machine models. We have three major results. First, if we use solely each technique which relaxes data dependences, "renaming" has a large effect on the TLSP architecture. Second, combinatorial use of "memory disambiguation" and "renaming" leads to huge parallelism. Third, constant effects are obtained by concurrent use of "value prediction" and other techniques.

  • Trends in High-Performance, Low-Power Processor Architectures

    Kazuaki MURAKAMI  Hidetaka MAGOSHI  

     
    PAPER

      Vol:
    E84-C No:2
      Page(s):
    131-138

    This paper briefly surveys architectural technologies of recent or future high-performance, low-power processors for improving the performance and power/energy consumption simultaneously. Achieving both high performance and low power at the same time imposes a lot of challenges on processor design, and therefore gives us a lot of opportunities for devising new technologies. The paper also tries to provide some insights into the technology direction in future.

  • Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

    Takatsugu ONO  Koji INOUE  Kazuaki MURAKAMI  Kenji YOSHIDA  

     
    PAPER

      Vol:
    E92-C No:4
      Page(s):
    433-443

    This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines.

  • Identifying Processor Bottlenecks in Virtual Machine Based Execution of Java Bytecode

    Pradeep RAO  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E92-C No:10
      Page(s):
    1265-1275

    Despite the prevalence of Java workloads across a variety of processor architectures, there is very little published data on the impact of the various processor design decisions on Java performance. We attribute the lack of data to the large design space resulting from the complexity of the modern superscalar processor and the additional complexities associated with executing Java bytecode using a virtual machine. To address this shortcoming, we use a statistically rigorous methodology to systematically quantify the the impact of the various processor microarchitecture parameters on Java execution performance. The adopted methodology enables efficient screening of significant factor effects in a large design space consisting of 35 factors (32-billion potential configurations) using merely 72 observations per benchmark application. We quantify and tabulate the significance of each of the 35 factors for 13 benchmark applications. While these tables provide various insights into Java performance, they consistently highlight the performance significance of the instruction delivery mechanism, especially the instruction cache and the ITLB design parameters. Furthermore, these tables enable the architect to identify processor bottlenecks for Java workloads by providing an estimate of the relative impact of various design decisions.

  • A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

    Mariko SAKAMOTO  Akira KATSUNO  Go SUGIZAKI  Toshio YOSHIDA  Aiichiro INOUE  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER-VLSI Architecture for Communication/Server Systems

      Vol:
    E90-C No:10
      Page(s):
    1972-1982

    Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.

  • Cell Library Development Methodology for Throughput Enhancement of Character Projection Equipment

    Makoto SUGIHARA  Taiga TAKATA  Kenta NAKAMURA  Ryoichi INANAMI  Hiroaki HAYASHI  Katsumi KISHIMOTO  Tetsuya HASEBE  Yukihiro KAWANO  Yusuke MATSUNAGA  Kazuaki MURAKAMI  Katsuya OKUMURA  

     
    PAPER-CAD

      Vol:
    E89-C No:3
      Page(s):
    377-383

    We propose a cell library development methodology for throughput enhancement of character projection equipment. First, an ILP (Integer Linear Programming)-based cell selection is proposed for the equipment for which both of the CP (Character Projection) and VSB (Variable Shaped Beam) methods are available, in order to minimize the number of electron beam (EB) shots, that is, time to fabricate chips. Secondly, the influence of cell directions on area and delay time of chips is examined. The examination helps to reduce the number of EB shots with a little deterioration of area and delay time because unnecessary directions of cells can be removed. Finally, a case study is shown in which the numbers of EB shots are shown for several cases.

  • A Reconfigurable Data-Path Accelerator Based on Single Flux Quantum Circuits Open Access

    Hiroshi KATAOKA  Hiroaki HONDA  Farhad MEHDIPOUR  Nobuyuki YOSHIKAWA  Akira FUJIMAKI  Hiroyuki AKAIKE  Naofumi TAKAGI  Kazuaki MURAKAMI  

     
    INVITED PAPER

      Vol:
    E97-C No:3
      Page(s):
    141-148

    The single flux quantum (SFQ) is expected to be a next-generation high-speed and low-power technology in the field of logic circuits. CMOS as the dominant technology for conventional processors cannot be replaced with SFQ technology due to the difficulty of implementing feedback loops and conditional branches using SFQ circuits. This paper investigates the applicability of a reconfigurable data-path (RDP) accelerator based on SFQ circuits. The authors introduce detailed specifications of the SFQ-RDP architecture and compare its performance and power/performance ratio with those of a graphics-processing unit (GPU). The results show at most 1600 times higher efficiency in terms of Flops/W (floating-point operations per second/Watt) for some high-performance computing application programs.

21-33hit(33hit)